Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mekhanik evgenii/1196 #17

Merged
merged 9 commits into from
Jun 27, 2024
Merged

Mekhanik evgenii/1196 #17

merged 9 commits into from
Jun 27, 2024

Conversation

EvgeniiMekhanik
Copy link
Contributor

No description provided.

@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as draft July 31, 2023 10:10
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 2 times, most recently from 46d5070 to 45138c4 Compare August 8, 2023 15:47
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as ready for review August 10, 2023 14:02
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 4 times, most recently from 9660c99 to 042a7d9 Compare November 23, 2023 17:54
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 4 times, most recently from 67f5e65 to fe13b02 Compare December 25, 2023 07:23
if (tcp_send_head(sk)) {
struct tcp_sock *tp = tcp_sk(sk);

#ifdef CONFIG_SECURITY_TEMPESTA
if (mss_now != 0)
__tcp_push_pending_frames(sk, mss_now, tp->nonagle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we initialize mss_now using tcp_current_mss() and get rid of mss_now != 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think about it, but we don't need to calculate mss_now if SOCK_TEMPESTA_HAS_DATA is not set and tcp_send_head is empty. I don't want to make extra calculations

@const-t
Copy link
Contributor

const-t commented Jan 31, 2024

tso_fragment() not used by Tempesta, we must move it to static scope again

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch from 8d02a09 to 13850d4 Compare May 1, 2024 04:06
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 2 times, most recently from b198d70 to 9472c4f Compare May 10, 2024 09:44
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 2 times, most recently from 437c1ef to 6bc466c Compare May 27, 2024 16:53
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 4 times, most recently from fdab4d7 to d1aaa66 Compare June 4, 2024 12:14
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

struct tcp_sock *tp = tcp_sk(sk);
unsigned int in_flight = tcp_packets_in_flight(tp);
unsigned int send_win, cong_win;
unsigned int limit;
int result;

if (!sk->sk_write_xmit || !skb_tfw_tls_type(skb))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!sk->sk_write_xmit || !skb_tfw_tls_type(skb))
if (unlikely(!sk->sk_write_xmit || !skb_tfw_tls_type(skb)))

Copy link
Contributor

@const-t const-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view this patch has some points to improve.

Let's start from discussion about the goals. As I understand our goal is to have as low as possible data in write_queue. For instance, if client is able to receive 1024 bytes, we must place the skb with size 1024 bytes to write_queue and send it in one shot, avoiding any segmentation. In other words tso_fragment must not be called at all.

However this patch works little bit different, the first problem described in comment that marked (2). The second problem is that ss_calc_snd_wnd() calculates cnd_wnd in a wrong way. It completely ignores TCP segmentation limits(see tcp_tso_segs()) and calculates cwnd without choosing min between halfcwnd and in flight bytes(see tcp_cwnd_test()). Second problem leads to segmentation in case when we have interface that has poor count of tso segments or congestion control module limits number of tso segs, thus we place big skb in queue which size bigger than max_segs.

Just assumption, maybe we can cache limit from ss_calc_snd_wnd() and use it in tcp_write_xmit, but it's only theory.

struct tcp_sock *tp = tcp_sk(sk);
unsigned int in_flight = tcp_packets_in_flight(tp);
unsigned int send_win, cong_win;
unsigned int limit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Why do we recalculate limit here? We have limit that calculated in tcp_write_xmit, the limit calculated correctly and we can use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to make one TLS record for several skbs. For example if mtu is equal 1500, our previously calculated limit is also about 1500, but if there are several skbs in our socket write queue we can encrypt them all in only one tls record! And do not call tls_encrypt several times

__func__, mss_now);
break;
}
TFW_ADJUST_TLS_OVERHEAD(limit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) This is looks not optimal. We calculate a limit, then in TFW_ADJUST_TLS_OVERHEAD we reduce the limit by tls overhead, the limit becomes lower than skb->len thus tso_fragment will split this skb, this is unnecessary overhead. After in tfw_tls_encrypt we will encrypt boths skbs, but the last one anyway will hang in write_queue.

Copy link
Contributor Author

@EvgeniiMekhanik EvgeniiMekhanik Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because we encrypt several skbs we don't need to adjust overhead

struct {
__u8 present : 1;
__u8 tls_type : 7;
__u16 flags : 16;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need flags to notify when skb is transmitted, e.g. tempesta-tech/tempesta#2108.
Of course, on_tcp_entail can do this as well, but it is not currently used for this purpose, at least not for all control frames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should implement additional on_send_ function in tempesta for 2108, not use flags in skb. Moreover we remove additional callback from kernel code, so 2108 should be reworked

Copy link

@kingluo kingluo Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can't be in on_send callback, it only indicates entry into the write queue of sk, but the counter decrement must happen at the time of actual transmission, because only at that point in time the skb is moved to the TCP protocol transmission process, i.e. tcp_write_xmit or its callers, e.g. __tcp_push_pending_frames, and not accumulated in kernel memory for some reason, such as congestion or zero TCP window of the malicious peer. This is exactly the meaning of this counter, to avoid control frame flooding.

That is, for 2108, the solution is either to keep the counter decrement in sk_write_xmit, and the tfw_cb.flag is still needed, or to manipulate the counter directly in ss_skb_on_tcp_entail. No additional callbacks needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ss_skb_on_tcp_entail is called before pushing skb to socket write queue. It is a good place to decrement counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applications which works in user space and counted control frames don't know nothing about __tcp_push_pending_frames and there is no problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can postpone such frames if it is really necessary (and send them only when we have enough window) and count them in ss_skb_on_tcp_entail. I think that we move tls_encrypt to Tempesta code and there is no sense to have one more callback in linux kernel only to count control frames. User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?

Copy link

@kingluo kingluo Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?

@EvgeniiMekhanik @krizhanovsky @const-t

I have to update some thoughts.

No, the user program relies on sockbuf to determine whether the skb send is in-flight. If sockbuf is full, the send() system call will fail because the peer does not provide enough window, or in non-blocking fd, the write event is shelved, then the counter will accumulate, and then the user program will know that it is in a flood attack.

(Moreover, the user program can use TCP_INFO getsockopt to retrieve the TCP information to get known the in-flight bytes. In fact, Nginx uses it to define some variables, although not for our purpose exactly.)

However, in our case, it's a completely different story. tempesta does not use sockbuf, but directly enqueues to sk's sk_write_queue, so it will have memory risks unless we know that skb has started to be transmitted, that is, in-flight. How to know this fact? The only place is that tcp_transmit_skb successfully handles skb. So this is why I suggest that we should apply the counting logic there. Neither on_send (called in ss_tx_action) nor entail (corresponding to the original prepare_xmit), or even write_xmit callback, can meet our needs to mitigate the flood of control frame attacks. This is not a theory, but has been verified through testing.

And the method suggested by @EvgeniiMekhanik does not work either, because similar to entail, the skbs pushed when the window is available does not mean that all skbs can be sent immediately, because it does not know how large the network capacity is and how many skbs can be sent, so it's likely that some skbs will be hold in the memory for an undetermined time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call ss_skb_on_tcp_entail just before we push skb to socket write queue. For HEADERS and DATA frames we calculate tcp window in sk_fill_write_queue and don't call ss_skb_tcp_entail->ss_skb_on_tcp_entail if we have no enough window. To solve problem with control frames we should not call ss_skb_tcp_entail->ss_skb_on_tcp_entail for all other type of frames if there is no enough tcp window. We should push them to the queue of postponed frames and push them to socket write queue only when we have enough TCP window. In this case we can counted control frames in ss_skb_on_tcp_entail

Copy link

@kingluo kingluo Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. This is sufficient if we do not care about the strict accuracy of whether the skb was successfully sent by tcp_transimit_skb or was even eliminated from tcp_rtx_queue.

Copy link
Contributor

@krizhanovsky krizhanovsky Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the flags are replaced with the on_send callback in #1973, so it's not desired to keep flags, especially with only one user. Meantine, I also don't like the idea of additional queueing since to wasted memory decreasing performance and opens opportunities for DoS attacks.

However, in first place, I didn't get what do we fix with #2108, so I requested changes. Maybe I'm wrong and let's discuss the fix, but I proposed to control and limit flows to client in tempesta-tech/tempesta#498 (comment)

P.S. @kingluo @EvgeniiMekhanik thank you for inviting me to the discussion

unsigned int limit,
unsigned int skbs);
unsigned int limit);
int (*sk_fill_write_queue)(struct sock *sk,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document this breaking change, explaining why sk_prepare_xmit is deprecated, and why the new sk_fill_write_queue requires some somewhat complicated limit calculations. If not, maybe the author himself will forget about it after a few weeks. It's better to document it in the code as a comment, because tempesta-tech/tempesta#1196 only discusses some theoretical design (which is even outdated after the code rewrite), not the actual implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it could make sense to document both the two callback, when they are called and what they do

Copy link

@kingluo kingluo Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't quite understand why we moved frames from prepare_xmit to tcp_push_pending_frames, which is the biggest change in this PR. In prepare_xmit, we already know that TCP has enough window to send data, but in tcp_push_pending_frames, we have to calculate it ourselves, which seems to be duplication of work. Could you clarify this further in comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add comment before merge PR. There were several problems with the previous approach:

  1. We push response skb directly to socket write queue. If skb belongs to stream, which is blocked by exceeding HTTP2 stream window, we can't send this skb to the client. This skb blockes all other skb in socket write queue, until window update will come.
  2. We can't prioritize streams. We push responses to socket write queue directly, so there is no prioritization. Now we push response skbs to our private scheduler queue. Then we call tcp_push_pending_frames where we run our scheduler to choose the most priority stream and send data according TCP window. Also tcp_push_pending_frames called from tcp_data_snd_check when we receive data from the peer and TCP window opened again.

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 6 times, most recently from 18581a9 to 9092882 Compare June 24, 2024 13:38
There were several problems with the previous approach:
 - We push response skb directly to socket write queue.
   If skb belongs to stream, which is blocked by exceeding
   HTTP2 stream window, we can't send this skb to the client.
   This skb blockes all other skb in socket write queue,
   until window update will come.
 - We can't prioritize streams. We push responses to socket
   write queue directly, so there is no prioritization.

Because of reworking framing in Tempesta FW
make some changes in Linux kernel:
- Remove unused `sk_prepare_xmit` callback
- Implement new `sk_fill_write_queue` callback,
  which is used to move skbs from Tempesta private
  write queue to socket write queue.
- Add new `SOCK_TEMPESTA_HAS_DATA` socket flag, which
  is used to indicate, that there is some data in
  Tempesta FW write queue.

Now we push response skbs to our private scheduler queue.
Then we call `tcp_push_pending_frames` where we run our
scheduler to choose the most priority stream and send data
according TCP window. Also `tcp_push_pending_frames` is
called from `tcp_data_snd_check` when we receive data from
the peer and TCP window opened again.
Since now we don't use skb private data in xmit
callbacks, we can use `skb->cb` to save skb private
data for our purposes.
- Codestyle fixes and move `tso_fragment` to
static scope.
- Close socket immediately in case when new
socket callback sk_fill_write_queue return
error.
- `sk_fill_write_queue` returns positive value
  if there is not sent data in our scheduler.
Use minimum of tcp sender window,
tcp receiver window and tls max payload
size to calculate size of tls record
instead of using mss for this purpose.
Pass TCP_NAGLE_OFF | TCP_NAGLE_PUSH flags when
we call `__tcp_push_pending_frames` frames in
kernel code as we do when we call it from
Tempesta FW source code.
Export `tcp_cwnd_restart` to be able to
call `tcp_slow_start_after_idle_check` from
Tempesta FW code.
We should check `seg_size` in `tcp_write_wakeup`
only when it uses as skb->len limit.
In Tempesta FW code we need to implement
different behaviour depends on place where
`sk_fill_write_queue` is called. (We need to
send `tcp_shutdown` if this function is called
from `ss_shutdown` and do not send it if it
is called from any other place.
There are a lot of places in our source code,
which are not clear.
@EvgeniiMekhanik EvgeniiMekhanik merged commit 09f6e2a into main Jun 27, 2024
@EvgeniiMekhanik EvgeniiMekhanik deleted the MekhanikEvgenii/1196 branch June 27, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants